SCONCE2: jointly inferring single cell copy number profiles and tumor evolutionary distances

Background Single cell whole genome tumor sequencing can yield novel insights into the evolutionary history of somatic copy number alterations. Existing single cell copy number calling methods do not explicitly model the shared evolutionary process of multiple cells, and generally analyze cells independently. Additionally, existing methods for estimating tumor cell phylogenies using copy number profiles are sensitive to profile estimation errors. Results We present SCONCE2, a method for jointly calling copy number alterations and estimating pairwise distances for single cell sequencing data. Using simulations, we show that SCONCE2 has higher accuracy in copy number calling and phylogeny estimation than competing methods. We apply SCONCE2 to previously published single cell sequencing data to illustrate the utility of the method. Conclusions SCONCE2 jointly estimates copy number profiles and a distance metric for inferring tumor phylogenies in single cell whole genome tumor sequencing across multiple cells, enabling deeper understandings of tumor evolution. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04890-w.

phylogenetic relationship among different cell types. A challenge in such efforts is that single cell sequencing data is typically very noisy due to variable and low sequencing depth [5], making accurate genotyping, copy number (CN) calling, and phylogeny estimation difficult. However, as we will show here, by leveraging the shared evolutionary history among cells, jointly calling CNAs across cells can lead to increased accuracy and give information about the evolutionary relationship between cells, thereby leading to improved estimates of tumor phylogenies. Different cells from the same tumor share some of their somatic evolutionary history, and information regarding CNAs, and CNA breakpoints, from one cell can, therefore, inform CNA calling in other cells.
Unfortunately, the commonly used methods for estimating single cell copy number profiles (CNPs), the collection of copy number states across the genome, do not rigorously use this shared information. Instead, most methods, including SCONCE [6] and the commonly-used AneuFinder [7,8], independently call CNPs. Although SCONCE [6], a copy number calling method for single cell tumor data, was previously shown to outperform competing methods in absolute copy number and breakpoint detection accuracy [6], it does not utilize any information from shared evolutionary histories between cells. Other methods, such as CopyNumber [9] and SCIcONE [10], jointly call CNPs by forcing breakpoints to be shared across all cells. However, we showed in previous work that SCONCE [6] outperforms these methods as well, despite not analyzing cells jointly. In contrast, WaveDec [11], a method designed to detect shared and cell specific copy number events in copy number arrays and applied to a subset of sequencing data from [12], takes an orthogonal approach by transforming log 2 -ratio copy number data into the wavelet space. This transformation allows separation of common/shared and individual CNAs, as shared events are captured by the approximation coefficients and individual events are described by the detail coefficients.
Despite these limitations in copy number calling, several distance metrics for copy number profiles have been developed for estimating tumor phylogenies using algorithms such as neighbor-joining [13,14]. Commonly used pairwise distance metrics include the Euclidean distance [12,15], the MEDICC distance described by [16], and the cnp2cnp distance presented by [17]. Although the Euclidean distance is easy to calculate, large and/or overlapping CNAs can artificially inflate this measure, leading to overestimation of dissimilarity. The latter two methods measure distance between two CNPs by attempting to find the minimum number of deletion and amplification events needed to transform one CNP into the other, without allowing regions that are lost to be regained. The MEDICC model is limited to maximum copy number 4 and events that increase or decrease copy number by one, while the cnp2cnp metric relaxes both of these constraints. Cordonnier et al. [17] re-implemented the MEDICC algorithm to allow copy numbers greater than 4, and showed that while both the cnp2cnp and MEDICC distances outperform the Euclidean distance for the purpose of phylogeny estimation, cnp2cnp is more accurate on error free data and MEDICC is more accurate on data with errors.
However, none of these methods use explicit evolutionary models of CNAs to provide joint estimates of CNPs and evolutionary distance. Here, we present SCONCE2, an expansion on SCONCE, that further develops SCONCE's underlying tumor evolutionary model to jointly model the CNA process in two cells. SCONCE2 takes advantage of the shared evolutionary history between cells, and produces more accurate single cell CNP estimates and pairwise estimates of the evolutionary distances between cells, by combining information across multiple cells. We show that SCONCE2 estimates more accurate CNPs and tumor phylogenies than competing methods using extensive simulations, and apply it to previously published data from [12,18] to illustrate its utility.

Results
To infer the evolutionary history of tumor cells, SCONCE2 models the evolution of pairs of cells. We assume a pair of cells, (A, B), have a partially shared evolutionary history originating from a healthy ancestral diploid cell, D. The shared part of their evolutionary history is represented in a tree, T = [t 1 , t 2 , t 3 ] , by a branch of length t 1 , running from an non-tumor diploid cell (D) to an unobserved divergence point, Z. From Z, cells A and B evolve independently, with branch lengths t 2 and t 3 , respectively (see Fig. 1). A core goal is to estimate this tree and to distinguish between shared evolutionary events and independent cell specific events.
Because the number of pairs of cells grows quadratically with the number of cells, n, full joint maximum likelihood estimation of all parameters can become computationally challenging. We, therefore, first run SCONCE on all cells independently to obtain cell specific estimates of model parameters. We then take the median of the estimates of evolutionary parameters {α, β, γ } , corresponding to the rates of different types of copy number events (see One cell continuous time Markov process), to combine the disjoint SCONCE estimates into summary estimates across all cells. Then, for each pair of cells, we estimate branch lengths of tree T = [t 1 , t 2 , t 3 ] using maximum likelihood, and use the Viterbi algorithm to calculate paired decoded copy number profiles. Because each cell appears in n − 1 pairs, this produces n − 1 paired CNP estimates per cell. Finally, for each cell, we take the per window mean across each cell's n − 1 paired CNP estimates to calculate consensus CNPs. This pipeline is described in Detailed SCONCE2 pipeline and illustrated in Fig. 2.
By analyzing each cell in the context of multiple pairs, we obtain increased accuracy in copy number calls and breakpoint detection, as well as usable tree branch length estimates. We examine the properties of these estimates on both simulated and real data. , between the pair of cells A and B, where the branch with length t 1 represents their shared evolutionary history from an ancestral diploid cell, D, before diverging at the unobserved state Z. The branches with lengths t 2 and t 3 show independent evolution to cells A and B, respectively. Copy number in adjacent bins along the genome is shown in parentheses

Simulations
In order to rigorously test SCONCE2, we applied it to four simulated datasets and two real datasets, from [12,18]. We simulated 128 cells on four different tree structures: tree A) is maximally imbalanced and ultrametric, tree B) is perfectly balanced and ultrametric, tree C) is maximally imbalanced and not ultrametric, where internal and terminal branches have uniform length, and tree D) is maximally imbalanced and not ultrametric, where internal branches have equal length and terminal branch lengths decay logarithmically (tree structures shown in Additional file 1: Fig. S1). Simulated cells from each tree structure were divided into five discrete test subsets of 20 cells each.
Briefly, the simulated genome is modeled as a collection of line segments, where amplifications and deletions occur according to a Markov process and have lengths sampled from a truncated exponential distribution. Copy number events occur within the tree structure, such that ancestral CNAs are propagated to descendent cells. Note, the simulation model is more biologically realistic and intentionally structured to be substantially different from the SCONCE2 inference model, in order to avoid biasing accuracy results to favor our method. We previously described this simulation model in [6], and full simulation details are given in Simulations.

Fig. 2
Detailed flowchart of the SCONCE2 pipeline. We demonstrate the pipeline with cell triplet {A, B, C} , without loss of generality. Each tumor cell is initially independently analyzed through SCONCE, which gives parameter estimates and copy number profiles for each cell (red box). These parameter estimates are then summarized (yellow box), and branch lengths for tree T = [t1, t 2 , t 3] are estimated for each pair of cells (green box). Finally, for each cell, paired copy number profiles are summarized into a consensus copy number profile (blue box). Each step is fully described in Detailed SCONCE2 pipeline

Sum of squared error on CNPs
To measure copy number accuracy, we calculated the sum of squared errors (SSE) between the inferred copy number and the true simulated copy number across genomic windows for each cell. To evaluate each step in the SCONCE2 pipeline, we calculated the SSE on copy number profiles generated from individual cell estimation (SCONCE), on profiles from each pair of cells (one pair), and on consensus profiles estimated using three different summary statistics (mean, median, mode) across multiple pairs of cells.
We also compared to AneuFinder [7,8], a commonly used method for single cell copy number calling, and the second-most accurate one, after SCONCE, among methods evaluated in previous work [6]. In all subsequent results, we report summary statistics across all subsets for each tree/simulation set. Recall full simulation descriptions are given in Simulations.
In tree A (maximally imbalanced ultrametric tree; Fig. 3A Fig. 3 Boxplots of genome wide sum of squared error (SSE) between true simulated copy number profiles and inferred copy number profiles, across methods. SSE results are shown for cell specific CNPs from SCONCE (independent cell inference); joint inference on each pair of cells (one pair); summary functions across all pairs of cells (mean, median, and mode); and AneuFinder. Different methods are shown across the x-axis, and SSE is shown on the y-axis. Median SSE for each method is printed at the top of each column. Each dot represents one cell (note, in "one pair", each cell appears multiple times), and the median SSE is printed at the top of each column. Panel letters correspond to tree labels A (maximally imbalanced ultrametric tree), B (perfectly balanced ultrametric tree), C (maximally imbalanced non ultrametric tree with uniform branch lengths), and D (maximally imbalanced non ultrametric tree with logarithmically decaying branch lengths  Fig. 3C, D). Clearly, there is a substantial improvement in accuracy by using pairs of cells instead of individual cells, and this improvement in accuracy is larger if multiple pairs are used. We note that, as an artifact of the genome binning procedure, true fractional copy numbers may occur from small CNAs completely contained within window boundaries, or from CNAs crossing window boundaries (for example, observing windows with true copy numbers 1 → 1.25 → 2 ). As such, the mean and median have the lowest SSE values because they allow fractional copy numbers. However, many downstream tools expect integer copy number profiles for single cells, so users may wish to round to the nearest integer or use the mode option.

Breakpoint distance and detection
In order to measure breakpoint detection accuracy, we calculated the genome wide distance between inferred and true breakpoints, penalized by the number of total inferred breakpoints. Specifically, for each simulated breakpoint, we calculated the distance to the nearest inferred breakpoint. Because erroneously inferring breakpoints at every position in the genome would artificially lower this genome wide distance, we also calculated ω = # inferred breakpoints # true breakpoints , such that lowest breakpoint distances with ω values closest to 1 indicate greatest accuracy.
In all simulation sets, using the mean consistently had ω values closest to 1, again due to fractional copy number states, as well as lower total breakpoint distance than other methods. Across trees, results from AneuFinder, followed by SCONCE, had the highest breakpoint distances and ω values further from 1.  Tables S2 and S3. Similarly to the observations for the CNP estimates, breakpoint detection also improves when using pairs of cells, and improves when estimates from multiple pairs are combined, particularly if combining using the mean. As previously noted in Sum of Squared Error on CNPs, binning the genome can result in fractional copy numbers for some bins. Compared to the median and the mode, the mean is better able to capture these fractional copy number states, resulting in lower breakpoint distances and ω values closer to 1.

Optimal number of pairs to use
Because there are n 2 pairs for n cells, averaging over more pairs of cells comes at a computational cost. Furthermore, as we will show, adding too many divergent cells can reduce the accuracy, as including highly divergent cells in the average may increase the noise.
To determine the optimal number of cells to summarize across, we estimated summarized (mean) copy number profiles with increasing numbers of cells. As each cell was added, we calculated the difference in SSE relative to SCONCE (individual cells). Cells were added in three different orderings: most to least similar (i.e., nearest first, as defined by the Euclidean distance between the cells' SCONCE profiles), least to most similar (furthest first), and randomized order. In tree A, the median pairwise Euclidean distance between SCONCE profiles was 88.0625 for the nearest/most similar cells, 138.9585 for the tenth most similar cells, and 205.853 for the least similar cells. Median pairwise Euclidean distances for all datasets are given in Additional file 1: Table S4.
In Fig. 5, we summarize the change in SSE across κ cells for each tree. Across all trees, SSE improves fastest when adding nearest cells first, and slowest for adding furthest cells first, with the random ordering in between. When adding nearest cells first, the SSE initially sharply decreases, levels off and reaches the largest decrease after approximately 10 cells, and then increases. Specifically, the mean change in SSE when adding nearest cells first reached the greatest decrease in SSE from SCONCE of -20.  Breakpoint detection accuracy results across methods. Total distance to nearest breakpoint is shown on the y-axis, and ω = # inferred breakpoints # true breakpoints is shown along the x-axis. Each dot represents one cell, colored by method. Methods with the highest breakpoint detection accuracy cluster near ω = 1 (vertical red dotted line) and have lowest total breakpoint distance. Each panel corresponds to a simulation set (A-D). In all simulation sets, using the mean, median, and mode have lower total breakpoint distance than independent cell analyses. Furthermore, using the mean results in ω values closest to 1, as it is able to infer fractional copy numbers When summarizing over κ < n − 1 cells, for a given cell, some of the other cells will not be used in that cell's consensus profiles. As a time saving measure, these excluded cell combinations are not analyzed. Therefore, we recommend users summarize over κ = 10 cells, added in order of most to least similar.
For completeness, SSE and breakpoint detection across parameter sets when summarized over only the nearest 10 cells is shown in Additional files 1: Figures S2, S3, and Tables S5 and S6.

Using multiple cells results in better CNA detection
Plotting true simulated copy number profiles against inferred copy number profiles shows why performance improves when using multiple cells. For example, in Fig. 6, SCONCE erroneously combined two breakpoints for cell A, while predicting cell B's breakpoint too far to the left (column labelled SCONCE). However, when analyzed as a pair, a shared breakpoint was inferred (left arrow), and the second breakpoint (right arrow) in cell A was correctly inferred (one pair column). While the shared breakpoint was closer to the true breakpoint, it was not until CNPs are summarized across multiple cells that the breakpoint was called in the correct position. Using the mean results in slightly fuzzier boundaries due to non-integer copy number calls (middle However, for this ordering, the decrease in SSE levels off after approximately 10 cells are added, and then slightly rises as more cells are added, due to rare copy number events getting averaged out column), which better reflected the true underlying data, while the median and mode (right two columns) result in integer jumps at bin boundaries. Similar results are observed for real data. For example, in Additional file 1: Figure S4, SCONCE missed the left most CNA in cell B (arrow in SCONCE column). When analyzed as a pair, this CNA was detected, and was shared with cell A (left arrows). Additionally, a short deletion was called in both cells (right arrows). However, this is a rare event, as it was averaged out in the mean, median, and mode analyses (arrows in mean, median, and mode columns). Furthermore, in Additional file 1: Figure S5, SCONCE did not call a CNA in cell A, but did call a -3 deletion in cell B (arrows in SCONCE column). However, when these two cells were jointly analyzed as a pair, there was enough evidence to call a -1 deletion in both. When summarizing across multiple cells, this deletion continued to be supported (right arrow). Additionally, there was some evidence from joint analyses with other cells that an additional small deletion existed in cell B, but not in cell A (left arrow). However, this small deletion was lost when using the median and mode, although the deletion first identified in the joint analysis of cells A and B remained.

Model parameter estimates
For each pair of cells, SCONCE2 estimates the branch lengths for tree T = [t 1 , t 2 , t 3 ] (see Fig. 1). From the simulated trees, the corresponding tree branch lengths and node distances can be extracted for each cell pair. Recall the simulation and inference models are intentionally formulated differently to evaluate SCONCE2 in more realistic settings (see Simulations). Because the scaling of T is different between the simulation and inference models, we show the R 2 values between true (simulated) and inferred values of , as well as the summed distance t 2 + t 3 as a distance metric between two cells.
We note that the sum t 2 + t3 has higher R 2 values than those of t 2 or t 3 individually, demonstrating some uncertainty in assigning events to particular branches. Furthermore, because the simulation model generates the number of CNAs from a distribution relating to a Poisson (however, the distribution is not truly Poisson as the size of the genome changes through the simulations), the mean and variance of the number of events increases with branch length in expectation. This increased variance is reflected by the larger range of branch length estimates as branch lengths increase. Nonetheless, as we will show in the next section, SCONCE2 recovers the magnitude of cell relationships sufficiently accurately to allow improved phylogeny estimation.

Phylogeny estimation
Estimating phylogenies on copy number profiles using neighbor-joining [13,14] requires a distance metric between cells. Existing metrics include the Euclidean distance [12], and two estimates of the minimum number of CNAs needed to transform one CNP into another: the cnp2cnp metric [17] and the MEDICC distance [16] (here, we use the implementation in the cnp2cnp program [17]). These methods require prior estimation of the CNP. See Running other methods for details on running these programs. Under the SCONCE2 model, by construction, t 2 + t 3 measures the pairwise distance between two cells. To compare these different distance metrics, we first calculated distance matrices using pairwise Euclidean, cnp2cnp, and MEDICC distances on CNPs called by SCONCE (previously showed to be more accurate than other single cell copy number callers [6]), as well pairwise t 2 + t 3 estimates. Next, we applied neighbor-joining to estimate phylogenies and computed the Robinson-Foulds (RF) distance [19] between the true trees and the inferred trees. As shown in Fig. 8, across parameter sets, the trees inferred from estimates of t 2 + t 3 had lower Robinson-Foulds distances than trees inferred from other distance metrics. For example, for tree A (ultrametric, maximally imbalanced), the median RF distances were 27, 27, 29, and 20 for the Euclidean distance, cnp2cnp distance, MEDICC distance, and t 2 + t 3 (Fig. 8), respectively (see Additional file 1: Table S7, for all median Robinson-Foulds distances).
Furthermore, we calculated RF distances from phylogenies based on the Euclidean, cnp2cnp, and MEDICC distances on consensus CNPs and true simulated CNPs (Additional file 1: Figure S6). When summarizing over all pairs of cells, using t 2 + t 3 consistently had lower median RF distances than other methods on consensus CNPs. For example, in tree A (ultrametric, maximally imbalanced), the median RF distances for phylogenies estimated from mean consensus profiles were 27, 26, and 28 for the Euclidean, cnp2cnp, and MEDICC distances (Additional file 1: Figure S6A). On distances calculated from the true CNPs, t 2 + t 3 performed as well or better than the other metrics, with the exception of the cnp2cnp distance in tree A, where the Euclidean, cnp2cnp, MEDICC distances and t 2 + t 3 had respective median RF distances of 27 Fig. 8 Robinson-Foulds (RF) distances between true and inferred phylogenies. Phylogenies were estimated using neighbor-joining on t 2 + t 3 estimates, and on the Euclidean, cnp2cnp, and MEDICC distances between true CNPs and CNPs inferred from SCONCE. Methods are shown along the x-axis, while Robinson-Foulds distances are shown on the y-axis. Across multiple parameter sets (panels A-D correspond to trees A-D), using t 2 + t 3 estimates resulted in a lower Robinson-Foulds distance from the simulated tree, relative to all other inferred phylogenies (Additional file 1: Figure S6A). However, under experimental conditions, the true CNP would be unknown. In all other simulation sets, the phylogenies estimated using t 2 + t 3 had lower median Robinson-Foulds distances than all other methods. For completeness, we additionally calculated Robinson-Foulds distances on phylogenies estimated from consensus CNPs from summarizing over the nearest 10 cells. For tree A (ultrametric, maximally imbalanced), the median RF distances on mean consensus CNPs over the nearest 10 cells were 27, 27, and 26 for the Euclidean, cnp2cnp, and MEDICC distances, respectively (Additional file 1: Figure S7A). Note that in order to estimate phylogenies using our t 2 + t 3 metric, all pairs of cells must be analyzed, and cannot benefit from the time savings of analyzing a selected subset of pairs of cells (see Optimal number of pairs to use). This is a weakness of our method, as analyzing all pairs of cells comes at an increased computational cost.

Discussion
We present a novel method, SCONCE2, that combines data across single cells in a manner that is grounded in a principled model of stochastic tumor evolution. It jointly calls copy number alterations in single cell sequencing of cancer cells with higher accuracy than competing methods on both simulated and real data. Additionally, SCONCE2 calculates an informative pairwise distance metric that can be used to estimate phylogenies with less error than other methods.
Similar to SCONCE, one weakness of SCONCE2 is the requirement for matched diploid cells in order to normalize GC content and mappability biases. These diploid cells must be sequenced under the same experimental conditions for proper GC content and mappability normalization, and may not be directly of interest to investigators, thereby potentially increasing cost. However, infiltrating diploid cells are often sequenced as a byproduct of single cell sequencing, and can be identified with orthogonal methods, such as cell sorting. For example, in the two datasets analyzed here, no additional sequencing was necessary to purposefully produce matched diploid sequencing data.
Additionally, SCONCE2 does not use SNPs or genotype likelihoods, or do any allelic phasing, to inform copy number calls or t 2 + t 3 estimates. Although calling SNPs in low coverage and noisy single cell data is difficult, incorporating genotype likelihoods can add information and increase confidence in these procedures. For example, using the allele frequency in variable single nucleotide sites can support concordant or rule out discordant copy number states. Furthermore, estimating the counts of variable sites on specific branches in T = [t 1 , t 2 , t 3 ] (see Fig. 1) can increase confidence in branch length estimates. Adding genotype likelihoods of single nucleotide variants is the subject of future work.
Another weakness of SCONCE2 is that it takes longer to run, relative to other methods. However, if investigators are primarily interested in copy number calling, significant time can be saved by summarizing over a selective subset of pairs of cells (that is, noninformative pairs are not analyzed), described in Optimal number of pairs to use. But, if investigators are interested in estimating phylogenies using our t 2 + t 3 metric, all pairwise distances must be estimated to calculate a complete distance matrix (described in Phylogeny Estimation), thereby negating this time saving measure. Because the distance matrix dimensions and number of pairwise comparisons grow quadratically with the number of cells analyzed, the computational complexity and run time cost grows quickly. However, if investigators are interested in both copy number calling and phylogeny estimation using our t 2 + t 3 metric, after all pairwise parameter estimates are calculated (the most computationally intensive step), investigators have the flexibility to quickly call consensus copy number profiles over an arbitrary number of pairs. Despite the computational complexity of this model, we propose the increased accuracy of both copy number calls and phylogeny estimation outweighs the increased computational run time cost.

Conclusions
In conclusion, we present a principled method, SCONCE2, for simultaneously and accurately calling and aggregating copy number profiles across multiple tumor cells, and estimating pairwise evolutionary distances, using single cell whole genome sequencing. This work shows jointly analyzing cells in single cell experiments to leverage their shared evolutionary history increases accuracy in copy number calling and phylogeny estimation, with implications for deepening our understanding of tumor evolution.

Evolutionary process modeling
We first review the Markov processes introduced in [6]. Briefly, we assume an evolutionary process that is continuous in time but discrete along the length of the genome. However, notice that this is just an approximation, as the true process along the length of the genome is not Markovian (see Simulations).

One cell continuous time Markov process
The one cell continuous time process from [6] models the copy numbers of two adjacent genomic bins, in positions i and i + 1 in the genome, on the same lineage (cell) with copy number U , V ∈ S c = {0, 1, . . . , k} , respectively, where k is the maximum allowed copy number. We assume that (U, V) evolve through time with the following rate parameters: which leads to the following instantaneous rate matrix for the joint process for two bins on one lineage:

rate of CNAs affecting both U and V
To ensure all rows sum to 0, we set the diagonal elements to the negative row sum, . Note that Q defines the instantaneous rate of events such as (U ′ , V ′ ) = (U + n, V − k), n > 0, k > 0 (i.e., events where (U, V) are changed by different CNAs) to be equal to 0. However, (U, V) can be changed by different CNAs for any evolutionary time interval t > 0 . Additionally, note we use the sum α + β as the rate for events with ±1 CNA, to allow ±1 copy number events to have a higher rate than larger magnitude copy number events.
The corresponding probability matrix, P , of time dependent transition probabilities of adjacent bins changes from (U, V) to (U ′ , V ′ ) is calculated as the matrix exponential where t is evolutionary time.

Two cell evolutionary process expansion
We now extend this single lineage process to describe the joint evolutionary process in two cells. Consider a pair of cells (A, B) and their most recent common ancestor in a tree, T = [t 1 , t 2 , t 3 ] , where t 1 denotes the branch length of their shared history and t 2 and t 3 denote the branch lengths from divergence, at unobserved state Z, to cells A and B, respectively (see Fig. 1).
Under this tree structure, adjacent bins in cells A and B have a shared evolutionary history for time t 1 from an ancestral diploid state (i.e., D : (2, 2)) to an intermediate unobserved state, Z : (W, Y), with associated transition probability P (2,2)

Approximating discrete Markov process along the genome
Next, we convert these continuous time process transition probabilities for adjacent bins in two cells into the transition probabilities for the approximating discrete Markov process for pairs of cells along the entire length of the genome, further described in Two Cell Hidden Markov Model Description. We do this by expanding the state space to the product space of the state space for each cell, S c . This expansion of the state space to the joint CN state for two cells is necessary as the correlation structure along the length of the genome prevents the use of standard tree-based dynamic programming algorithms such as Felsenstein's pruning algorithm [20].
The state space, S d , for this discrete process is composed of pairs (CN iA , CN iB ) , representing the copy numbers for window i in cell pair (A, B), where CN iA , CN iB ∈ S c = {0, 1, . . . , k} for a fixed maximal copy number k, such that (CN i+1,A ,CN i+1,B ) (T )} as the transition probability of moving from state (CN iA , CN iB ) in window i to (CN i+1,A , CN i+1,B ) in window i + 1 , given evolutionary tree T = [t 1 , t 2 , t 3 ] , for cell pair (A, B). Therefore, the matrix F(T ) is defined as which can be used to calculate a transition matrix, M(T ) , along the length of the genome for pairs of cells. This is done by dividing the joint probability of the CN state in both cells (A and B) in both bins (i and i + 1 ), with the marginal probability of CN state in both cells (A and B) in bin i, i.e., dividing each entry in F(T ) with the corresponding row sum: We have thereby constructed a process with state space on the copy numbers of pairs of cells, S d . The matrix M(T ) gives the probabilities of observing transitions from (CN iA , CN iB ) in window i to (CN i+1,A , CN i+1,B ) in window i + 1 , along the genome, for cell pair (A, B), given evolutionary tree T . We also note that the process along the length of the genome is not Markovian, as breakpoints appear in pairs, inducing an inherently non-Markovian correlation structure (see also [6]). However, to facilitate computation, we will approximate the process as a Markovian process with transition probabilities given by M(T ) . We note that while this model approximates the evolutionary process and paired nature of breakpoints via the genome wide transition matrix M(T ) , it does not explicitly model pairs of breakpoints jointly, potentially leading to unpaired breakpoints. This Markov chain will then be used for inferences in a Hidden Markov Model framework with emission probabilities similar to those described in [6].

Two cell hidden Markov model description
Expanding on the framework of [6], we define a Hidden Markov Model (HMM) [21][22][23] to infer copy number across the genome for pairs of tumor cells, using binned read depth data. {(0, 0), (0, 1), (0, 2), . . . , (k, k)} Recall that cells A and B are associated with the evolutionary tree, T , shown in Fig. 1. The sample space of read data, A , is composed of pairs of observed per window read count values, (x iA , x iB ) , x iA , x iB ∈ N 0 = {0, 1, 2, . . . }: This HMM uses the state space of paired copy numbers, S d , defined in Eq. 4 and the transition matrix M(T ) , defined in Eq. 6.

Emission probabilities
Assuming conditional independence between cells, the emission probabilities of the HMM are: As these probabilities are calculated similarly for cells A and B, we only describe the derivation for cell A (note: we previously described this derivation in [6]).
We assume X iA follows a negative binomial distribution, such that where Estimation of constants {a, b, c} is described in Initial independent parameter estimation using SCONCE. In the following, we will describe the full SCONCE2 estimation procedure in detail.

Detailed SCONCE2 pipeline
Given binned read depths for tumor and matched diploid cells, joint copy number calling in SCONCE2 takes place in four main steps: (1) independently estimating model parameters and copy number profiles for each cell using SCONCE, (2) combining independent parameter estimates across cells, (3) estimating tree branch lengths for each cell pair, and (4) creating summarized copy number profiles. This process is illustrated in Fig. 2.

Initial independent parameter estimation using SCONCE
We first estimate constants {a, b, c} , defined in Eqs. 10 and 11e, using maximum likelihood on diploid cells only, as previously described in [6]. We note that most single cell tumor sequencing projects naturally also produce data from non-tumor diploid cells as part of standard sequencing techniques, and that these cells conveniently can be used for standardization [12,[24][25][26][27][28][29][30].
In order to obtain initial estimates of all model parameters, we analyze all tumor cells independently through SCONCE, described in detail in [6] and briefly summarized here. This is done to avoid the computational cost of joint estimation for all model parameters across all pairs of cells. The SCONCE pipeline first estimates the transition matrix of an unconstrained CN HMM, with associated library size scaling factor s A , for each cell using a modified Baum-Welch [31] algorithm. These estimates are then used to obtain initial starting points for each model parameter for an optimization of the likelihood function using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [32]. This results in parameter estimates {ŝ A ,α A ,β A ,γ A ,t A } , for cell A. Recall s A is the library size scaling factor, defined as the coverage for the cell relative to the average diploid library size, {α, β, γ } are the instantaneous rates for copy number events, and t A is the total branch length from the ancestral diploid cell to cell A (see the red block in Fig. 2.).

Combining parameter estimates across multiple cells
To analyze shared evolutionary history between n cells, we first combine independent estimates across all cells of the transition rate parameters, {α, β, γ } , assumed to be shared among all cells, using the median to form joint estimates: We note that a full joint optimization could possibly lead to better model parameter estimates, especially for highly heterogeneous tumor populations, as the median is not strongly affected by extreme or highly variable individual estimates. However, we opted not to pursue the estimation of such estimates because of considerations of computational efficiency. This is illustrated in the yellow block in Fig. 2.

Estimating pairwise tree branch lengths
Next, we estimate parameters of the joint two-cell process for all n 2 pairs of cells. The branch lengths of tree T = [t 1 , t 2 , t 3 ] , are specific to each pair of cells, and branch length estimates from SCONCE, t , are used to inform the initial optimization starting point for T . For example, for pair (A, B), the initial branch length estimates, denoted with * , are: For each pair of cells, we use the BFGS algorithm to maximize the forward log likelihood in order to estimate T . To calculate the forward log likelihood of an observed sequence, the HMM is reset into the initial probability vector, defined as the steady state distribution, at the beginning of each chromosome to ensure chromosomal independence.
Because each set of branch lengths, [t 1 , t 2 , t 3 ] , is specific to each pair of cells, this procedure is trivially parallelizable (see the green block in Fig. 2).

Summarized copy number calling
After pairwise branch lengths are estimated, we use the Viterbi algorithm [33,34] to estimate the most likely joint copy number profile for each pair of cells. If cell A appears in n − 1 pairs, this results in n − 1 separate CNP estimates for cell A. In order to calculate a single consensus copy number profile, CN A,consensus , we use either the mean (default), median, or mode of the CN in each window among the n − 1 estimates.
While adding more information to each consensus copy number profile by summarizing across multiple cells initially increases accuracy, summarizing across too many divergent cells is not optimal because more accurate estimates about each cell are obtained using closely related cells than highly divergent cells (Fig. 5). Therefore, in order to balance combining data from multiple cells and maintaining cell specificity, the user can also choose to summarize across a subset of the κ nearest neighbors for each cell, instead of all n − 1 pairs a particular cell appears in. The nearest neighbors for each cell is defined by the Euclidean distance between individual copy number profiles from SCONCE. Then, the consensus copy number profile is calculated only across the κ selected pairs. Note, because of the genomic binning procedure, true copy number events may be split across bin boundaries or be completely contained within one bin, resulting in bins with non-integer average copy number. Using the mean and median summary functions can result in non-integer copy number calls, which more accurately represent the underlying biology as genomes are not truly organized in discrete bins. However, many downstream tools for single cell analyses require integer copy number profiles, so these values may need to be rounded for downstream analyses.

Simulations
In order to evaluate the accuracy of SCONCE2, we use the Line Segments model from SCONCE [6], which simulates copy number events on a fixed length reference genome as additions or deletions to a collection of line segments, and does not impose a maximum copy number limit. Note that although copy number events change the number and length of line segments, the reference genome length is constant. Additionally, copy number events create pairs of breakpoints at either end of the event, which are explicitly maintained in this simulation model, unlike the approximating discrete Markov process in the SCONCE2 inference model (see Approximating discrete Markov process along the genome), thereby making the simulation model more biologically realistic.
While we previously used neutral coalescent simulations [6]to define trees, we here instead adjust the tree structure and the length of the branch leading to the root (ie, time to first divergence event) to examine a range of highly different tree structures, each including 128 cells. We specify two ultrametric trees with uniform branch lengths of 1/128, where tree A is fully pectinate/maximally imbalanced and tree B is perfectly balanced, and two non ultrametric trees, where tree C has uniform internal and terminal branch lengths of 1/128, and tree D has uniform internal branch lengths of 1/128 and logarithmically decaying terminal branch lengths. These tree structures represent extremes in terms of how balanced the tree is and in terms of deviations from a molecular clock (ultrametric property). Following the definitions of [35], tree A models branching evolution, tree B models neutral evolution, and trees C and D model linear evolution. Under certain conditions, the structure of tree B can also be adjusted to model punctuated evolution [35] if the branch leading to the root is lengthened relative to the internal tree branches, such that more mutations fall on the shared ancestral/root branch compared to external branches. For illustration, the tree structure for 8 cells is shown for each dataset in Additional file 1: Figure S1.
For each tree, the total tree height (longest path from the root to a leaf ) was scaled to 1, and the branch leading to the root was set to length 1. Simulated reference genome lengths were set to 100, with amplification and deletion rates and expected lengths shown in Additional file 1: Table S1 (note that genomic length units are arbitrary, where expected copy number event lengths are defined relative to the genome length). As previously described in [6], the locations of copy number events follow a Markov process, and the lengths of copy number events follow a truncated exponential distribution.
To simulate read depths across the genome, the human reference genome was divided into 12,397 windows (equalling the number of 250 kb non overlapping uniform windows in hg19), and the number of reads falling into each window was simulated from a negative binomial distribution with parameter r = 50 . This results in files listing genomic window coordinates and number of reads observed in that window, for every simulated cell (similar to output from bedtools coverage [36] on real data). The total expected number of reads for each cell was set to 4,000,000 (322.7 expected reads per window) to approximate the observed number of reads per 250 kb window (mean 316.1, median 351.2) in diploid cells from [12]. Note the actual number of total observed reads in each cell is random.
In order to ensure tree B was a perfectly balanced binary tree and to be consistent between tree structures, read depths for 128 tumor cells and 100 diploid cells were simulated for each tree. Read depth across diploid cells was averaged per window for each tree. Tumor cells from each tree were divided into five non overlapping subsets of 20 cells to create test sets. Although healthy cells were shared for each analysis run, each test set was otherwise analyzed independently from other test sets from the same tree.
All parameter files used to generate simulations are available on GitHub, along with examples of simulated data.

Real data preprocessing
We applied SCONCE2 to two published single cell breast cancer datasets, from [12] and [18], a cancer type known for their frequent CNAs [37]. Both of these datasets were processed as previously described [6]. Briefly, for the [12] dataset, we trimmed reads using cutadapt [38] and trimmomatic [39], removed low complexity reads with prinseq [40], aligned reads to hg19 using bowtie2 [41], removed reads with q scores less than 20 using samtools [42], and removed PCR duplicates using picard [43]. For the [18] dataset, we split downloaded preprocessed bam files into cell specific bam files using pysam [44], and removed reads with q scores less than 20 using samtools [42]. Finally, we used bedtools coverage to count per window read depth for each cell [36]. Cells previously and orthogonally identified as diploid cells in [12] served as the matched normal. For the [18] dataset, cells from subset A were used as the diploid samples, as previously described [29].

Running other methods
For benchmarking, we limit our comparisons to other copy number only methods (that is, no SNP or phasing information is used): SCONCE [6] and AneuFinder [7,8] for copy number accuracy, and the cnp2cnp [17] and MEDICC [16] distances for phylogeny building.
Briefly, we ran AneuFinder with default parameters, with the exception of skipping GC and mappability corrections to avoid overcorrecting, as we did not include GC or mappability bias in our simulations. To benchmark SCONCE2's copy number calling, we first ran SCONCE [6] with default parameters (k=10). To run AneuFinder [7,8], we skipped the GC and mappability corrections steps to avoid over correcting, as our simulation model does not include GC or mappability biases. We directly ran Aneu-Finder's findCNVs function (default parameters: method="edivisive", R=10, sig.lvl=0.1). We extracted copy number calls from the resulting the copy.number element, and used bedtools intersect [36] to split large segments into 250 kb windows.
To evaluate SCONCE2's t 2 + t 3 distance metric in phylogeny estimation, we compared to the cnp2cnp distance [17] and the MEDICC distance [16]. To run cnp2cnp, we first converted and rounded called CNPs into fasta files, then ran cnp2cnp in matrix mode with default parameters (-m matrix -d any). Because the cnp2cnp metric depends on the input sample ordering and is not symmetric, we repeated this process on the reversed sample ordering, and summed the two resulting distance matrices to make a symmetric metric. To calculate the MEDICC distance, we used the ZZS implementation of the MEDICC algorithm in the cnp2cnp program to remove the maximum copy number limit of 4 in the original MEDICC software, and ran it on the same fasta files (-m matrix -d zzs). Full scripts to run other methods are provided on GitHub (runAneufinder.sh and runCnp2cnp.sh).

Phylogeny estimation and Robinson-Foulds distance calculations
To estimate phylogenies, distance matrices were first read into R [45]. Next, we applied neighbor-joining [13,14], implemented in the ape [46] package, to each distance matrix to estimate phylogenies.
To calculate Robinson-Foulds distances between inferred trees and true trees, we first used the read.tree function in the ape [46] package to read true trees in Newick format into R. Next, we used the treedist function from the phangorn [47,48]